In the present document, we aim to conduct a systematic review of the literature on short term health effects of air pollution. The objective is two fold: - Retrieve effect sizes and confidence intervals in order to compute power, type M and type S error in the literature - Get a sense of the proportion of papers in this literature discussing power and missing data issues.
In this section, we discuss the importance of such an analysis. (Yet to be written)
In this section, we implement robustness tests in order to compute the power, type M and type S error in the studied articles. We look at what would be the power, type M and type S error if the true effect was a fraction of the measured effect. We retrieved estimates and confidence intervals of articles in the literature of interest in another document. Before looking into the power analysis itself, we look at the characteristics of the articles considered.
We retrieved the articles using the following query:
‘TITLE((“air pollution” OR “air quality” OR “particulate matter” OR ozone OR “nitrogen dioxide” OR “sulfur dioxide”) AND (“emergency” OR “mortality”) AND NOT (“long term” or “long-term”)) AND (“particulate matter” OR ozone OR “nitrogen dioxide” OR “sulfur dioxide”)’
This query returns 1649 articles. Based on the abstracts, we can briefly explore the main (unsurprising) themes of the articles:
Out of all articles returned by the query, 700 display confidence intervals. “CI”, “confidence interval”, etc:
In these articles, we retrieve valid effects and confidence intervals in the following proportions:1
| Effect retreived | Number of articles | Proportion |
|---|---|---|
| Yes | 592 | 0.8457143 |
| No | 108 | 0.1542857 |
This corresponds to 1858 valid effects and associated confidence intervals.
Here is a random example of the effects and confidence intervals detected by our method (highlighted in gray):
In this subsection, we investigate whether there are systematic differences between articles displaying an effect that we detected in the abstract and articles that do not display an effect or for which we did not detect the effect.
We first wonder whether there are disparities in publication dates. It might be the case that displaying effects in the abstract was a feature of a given period.
Even though there are slightly more recent (2010-2020) articles for which effects are retrieved, the difference does not seem to be substantial.
We also investigate whether there are differences in the journals in which the articles are published.
For this analysis to be informative, we would need to cluster the journals into groups (eg epidemiology journals, general science journals, etc).
Then, we wonder if the the themes considered in each types of abstracts differ.
Apart from a few key terms, such as CI, 95 for instance, there are no huge variations in the themes.
We do not seem to detect effects more for a pollutant than for another. Note that if an article considers several pollutants, it will appear several times in this graph.
Now that we have quickly compared the articles for which we retrieve an effect an those for which we don’t, we can dig further into the analysis of the estimates retrieved.
In this section, we briefly analyse the effects retrieved. First, we look into the proportion of effects which are significant.
Non surprisingly, most of the effects retrieved here are significant. These effects are reported in the abstracts and with confidence intervals.
We the look into the distribution of the t-scores.
We notice that there is some sort of bunching for t-scores above 1.96. We might need to investigate further whether there is some evidence of publication bias. Yet, our analysis itself might be bias to some extent since we are only considering estimates from the abstract. Authors may choose to report in the abstract only statistically significant estimates, even though they also have non significant estimates in the body of the article. We could investigate this further by reproducing the present analysis but on the full texts and not only on the abstracts.
We then plot the distribution of the signal to noise ratio, ie the ratio of the point estimate and the width of the confidence interval.
The graph is of course analogous to the previous one. It however informs us that in a large share of the studies, the magnitude of the noise is larger than the magnitude of the effect. Looking in more details into the distribution of the signal to noise ratio, we notice that for 40% of the estimates considered here, the magnitude of the noise is more important than those of the signal.
| Signal to noise ratio | |
|---|---|
| 0% | 0.0322581 |
| 10% | 0.5384615 |
| 20% | 0.6607381 |
| 30% | 0.8277522 |
| 40% | 1.0289348 |
| 50% | 1.3533073 |
| 60% | 2.2766337 |
| 70% | 4.7265455 |
| 80% | 10.0446281 |
| 90% | 24.8030488 |
| 100% | 834.8333333 |
We then turn to the power analysis itself. To do so, we use the package retrodesign which computes post analysis design calculations (power, type M and type S errors). We run retro_desing for several effect sizes.
In a first part, we carry out our analysis on the whole set of articles. We notice that there is some heterogeneity across articles, with some articles displaying a high power and others displaying lower power. Thus, in a second part, we will look in more details at articles displaying low power
We start by computing the average and median power, type M and type S errors.
| Mean | Median | Mean | Median | Mean | Median | |
|---|---|---|---|---|---|---|
| 0.01 | 0.1043085 | 0.0503224 | 55.922046 | 44.084318 | 0.3385168 | 0.4383054 |
| 0.05 | 0.2526895 | 0.0580982 | 11.339701 | 8.886364 | 0.1895556 | 0.2243359 |
| 0.10 | 0.3424199 | 0.0828139 | 5.831147 | 4.523520 | 0.1087653 | 0.0770272 |
| 0.33 | 0.5488371 | 0.4172006 | 2.093454 | 1.534451 | 0.0141896 | 0.0002478 |
| 0.50 | 0.6629577 | 0.7556944 | 1.579196 | 1.156594 | 0.0054233 | 0.0000026 |
| 0.67 | 0.7570693 | 0.9445702 | 1.345587 | 1.033339 | 0.0029559 | 0.0000000 |
| 0.75 | 0.7937769 | 0.9782421 | 1.277935 | 1.013360 | 0.0023906 | 0.0000000 |
| 0.90 | 0.8500446 | 0.9975569 | 1.190855 | 1.001596 | 0.0017246 | 0.0000000 |
| 1.00 | 0.8794131 | 0.9995884 | 1.151620 | 1.000280 | 0.0014349 | 0.0000000 |
Then, we look at the distribution of power, type M and type S error across simulation and for different size of true effect.
A large chunk of articles display high power and low rates of type M and type S error, in each robustness check. However, a non negligible number of articles display lower power and/or some evidence of type M error. Type S error do not seem to be an important issue here. We investigate potential causes for the lack of power and for the type M errors further in the next subsection.
Note that for type M errors, due to some outliers, we used a log scale. Without the log scale and restricting our sample to type M errors lower than 2.5 (95% of our sample, even for a effect considered being one third of the true effect).
We find that, even if the measured effect is the true effect, there is some risk of type M error.
Then, we look how type M and type S error evolve with power in the estimates considered.
There is a one-to-one relationship between power and type M and type S error. Not surprisingly, type M and type S error skyrocket in studies with low power.
We then investigate how power, type M, type S and
Power, type M and type S errors also skyrocket for small values of the true effect (as a proportion of the measured effect). In addition on average, if for each paper of the literature, the true effects are only three quarter of the measured effect, the power would be lower than the usual 80%. Type S error only seem to be an issue for small values of the true effect as a portion of the measured effect. Type M error seems to be more consistently problematic. The shoot up in the previous graph makes it difficult to read the values of type M error when the true effect is not a small portion of the measured effect. We therefore zoom in.
We notice that, on average in the literature, the treatment effects are overestimated, even for large values of the true effect. This result might be linked to some outliers. We therefore look into the evolution of the median effect with true effect size.
We notice that the issue is much less important when looking at the median. This suggests some heterogeneity in terms of power in the literature.
It might also be interesting to look at how power, type M and type S error evolved in time, ie with publication date.
There does not seem to be a clear trend in the evolution of power and type S error. However, type M error seems to have peaked in the 2010s and to be decreasing again recently.
In the previous section, we noticed that a non negligible number of studies seemed to suffer from a low power issue and associated type M error. We consider that effects for which power is lower than 80% if the true effect is 3/4 of the measured effect. 80% is the threshold usually used in power analyses but 3/4 is arbitrary and could be changed easily in a robustness check. Following this criterion, the number and proportion of estimates with low power is as follows:
| Power | Number of estimates | Proportion |
|---|---|---|
| Adequate power | 1179 | 0.6345533 |
| Low power | 679 | 0.3654467 |
We investigate the particularities of the articles with low power. We start by reproducing the analyses used to compare articles for which we retreived an effect and those for which we did not. First, we look into the distribution of publication dates.
It seems that less articles with low power have been published recently, in comparison to articles with adequate power. This confirms our previous finding. We then look into the distribution of articles
Interestingly, some journals, such as “Science of the Total Environment”, the “International Journal of Occupational Medicine and Environmental Health”, the “Chochrane Database of Systematic Reviews”, “Environmental science and pollution research” and the “Journal of Exposure Science and Environmental epidemiology” publish large share of low power studies. On the contrary, BMJ Open publish very few low power studies.
Here also, grouping the journals into big main themes could be more instructive.
We also look into disparities
There does not seem to be stark differences by pollutant type.
Note that a bunch of abstracts contain the phrase “CI” without actually displaying effects and confidence intervals.↩︎